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AN OPERAND CACHE FOR REDUCING LATENCY OF 
S READ INSTRUCTIONS * 

BACKGROUND OF THE INVENTION 

The present invention relates generally to processor design and 

10 more specifically to techniques; for caching the values of instruction 
operands stored in memory. v 

Various^ techniques in the field of computer architecture have 
been developed for increasing processor performance beyond what can be 
achieved. solely via process or circuit design improvements . One such 

15 technique is . pipelining. Pipelining was extensively examined in "The 

Architecture of Pipelined Computers, M -by Peter M. Kogge (McGraw-Hill, 1981). 
J.L. Hennessy and D. A. Patterson provide a contemporary discussion of 
pipelining in chapter 6 of "Computer Architecture, A Quantitative Approach" 
(Morgan Kaufmann, 1990) . > \';\ .-v 

20 ~ >' ~~ Tpp I peTXne" pFbces sor s~~decompose the execution of~ i n s t r u c t ions into 
multiple successive stages, such as fetch, decode, and execute. Each stage 
of execution is designed to perform its work within the processor's basic 
machine cycle. Hardware is dedicated to performing the work defined by each 
stage. As the number of stages is increased, while keeping. the work done by 

25 the instruction constant, the processor is said to be more heavily 

pipelined. Each instruction progresses from stage to stage, ideally with 
another instruction progressing in lockstep only one stage behind. Thus, 
there can be as many instructions in execution, ■as there are pipeline 
stages.. The major attribute of a pipelined processor is that a throughput 

30 of one instruction per cycle can be obtained, though when viewed in 

isolation, each instruction requires .as many cycles to perform as there are 
^pipeline stages. * f * ' : 

The^ability to increase throughput via pipelining is limited by 
situations called pipeline hazards. Hazards may be caused by, among other 

35 .things, data dependencies that arise due to the overlapping stages of 

instruction processing inherent in, the pipeline technique. One type of data 
dependency that frequently arises is associated with an instruction that 
retrieves an operand frommemory into a register. Later instructions that 
have. progressed to a pipeline stage in which an operation using the value 

40 stored in that register is to be performed, and instructions depending on 

the results of operations that use the value stored in the register, must be 
stalled until the operand is retrieved from memory, i.e. until, after the 
physical memory, address of the operand is determined and the operand is 
retrieved from the memory (either from a unified or data cache, or in the 

45 worst case from a performance point of view, from main memory) . 

In other words, the, inter-stage advance of instructions might 
have to be stalled until the required operand is retrieved. Otherwise, 
improper operation would result. To prevent such . incorrect behavior, 
"interlock- logic is added to detect this hazard and invoke the pipeline 
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stall. While the pipeline is stalled, there are stages in the pipeline that 
are not:, doing any useful work. Since this absence' of work propagates; from 
stage to stage , the; term pipeline bubble is also used to describe this 
condition. The throughput, of the processor suffers whenever such bubbles 
5 ' occur . , •• .:.' :: > ' .. \ 

Many potential stalls resulting from data hazards can sometimes 
be avoided if a program kis compiled initially or later recompiled using an 
optimizing compiler that rearranges program instructions in a manner that is 
custom, tailored. to the microarchitecture of the processor. Such optimizing 

,10 compilers are relatively new, have restricted availability, and do not 

benefit: programs that are already in the field. Rearranging instructions 
; w using : an • optimizing compiler is ; referred to as static instruction 
scheduling. The Intel Pentium™ Processor is an example of a processor that 
relies on static instruction scheduling to achieve its full promised 

IS 1 performance. 

. -''^ in contrast to static techniques, dynamic instruction scheduling 

techniques act to rearrange the program instructions at the time the program 
is running. Dynamic scheduling does not require the use of an optimizing ^ 
compiler and thus benefits all programs, both new and existing. One dynamic 
20 instruction scheduling technique involves the use of largely autonomous 

execut ion unit s t hat can queue up operations and execute them out of order. 
One such system is described in U.S. Patent No. 5,226,126, ('126) PROCESSOR 
HAVING PLURALITY OF FUNCTIONAL UNITS FOR ORDERLY RETIRING OUTSTANDING 
OPERATIONS BASED UPON ITS ASSOCIATED TAGS, to McFarland et al . , issued July 
25 6, 1993,, which is assigned to the assignee of the present invention, and 
hereby incorporated by reference for all purposes. 

" In the * 126. processor a decoder issues operations simultaneously 
to all of the execution units, each of which queues up only the operations 
that require its services . During each cycle; each execution unit can 
30 ■ service any operation from its queue not subject to an interlock, i.e. can 
execute operations out-of-order / Thus, despite the fact that some 
operations queued in the execution units may be subject to an interlock due 
to data (and other types of) hazards/ there is a greater chance during any 
given cycle that an execution unit has ^useful work to perform. 
35 > Out-of-order execution tends to localize the effects of 

dependencies to a single execution unit. Because of their loose coupling 
and independent execution, stalls that affect only one execution unit can be 
effectively absorbed when that unit is later able to proceed past another 
execution unit that is held up due to a different dependency. If out-of- 
40 order execution were not used, often many of the execution units would be 
unnecessarily idle. Out-of-order execution results in the execution units 
doing useful work most of the time. 

However, even in processors with loosely coupled execution units 
capable of out-of-order execution (such as the '126 processor) there is a 
45 limit: on the number of operations permitted to be outstanding at any time, 
i Thus, given the significant frequency of operations depending (directly or 
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indirectly) on the value of operands read from memory, there is a limit on 
the degree to which the execution units in even such processors can be kept 
busy. 

SUMMARY OF THE INVENTION 

The inventors have observed that the operand (operands) stored 
in memory required for the execution of a. program instruction frequently 
does (do) not change between repeated executions of the instruction, and 
have devised a technique to take advantage of this phenomenon in order to 
improve processor performance. 

In particular, the invention provides a structure for, and a 
method. of operating, an operand cache to store operands retrieved from a 
memory. An instruction requiring an operand stored in the memory, is 
allowed to speculatively execute in an execution unit of a processor using 
an operand stored in an operand cache entry corresponding to the address of 
the instruction. A primary advantage of this approach is the (speculative) 
removal-af interiocks-f rom~operat"ions" that~are" waitTng on the value of the 
operand. - _ 

When the actual operand is later retrieved from the memory it is 
compared to the cached operand used for speculative execution. If the 
cached and actual operands are unequal then the actual operand overwrites 
the cached operand in the operand cache, the speculatively executed 
instruction and all subsequent instructions are aborted, and the processor 
resumes. execution at the address of the instruction that was speculatively 
executed. 

In recognition of the significant cost in processor performance 
associated with abortion of a speculatively executed instruction (and 
subsequent instructions) that uses an incorrect cached operand,, the 
inventors have devised a technique for increasing the coherency between the 
operand cache and the memory. In particular, each entry cf-the operand 
cache may include an operand address corresponding to the operand, stored in 
the entry.. After the address of the actual operand for ' a speculatively 
executed instruction, is determined, . it is compared against the operand ' 
address stored in the operand cache entry corresponding to the ^speculatively 
executed instruction. The actual operand address overwrites the cached 
operand address if different. Whenever a write of -a particular value to a 
particular address in the memory occurs, the particular value overwrites the 
operand stored in each entry of the operand cache whose operand address is 
equal to the particular address. 

Another technique for reducing the frequency (and the associated 
overhead) of speculative executions using incorrect operands entails 
estimating the likelihood of the corresponding cached operand being correct. 
In a specific embodiment, this is achieved by storing a count in each entry 
of the operand cache, and permitting speculative execution using a cached 
operand only if the corresponding count exceeds a predetermined threshold. 
The cached operand is compared with the actual operand retrieved from the 
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: memory (even if ^speculative execution using the cached operand was not 
permitted on account of an insufficient count) and the corresponding count 
is incremented (decremented) if the two operands are equal (unequal) . 
V . For a fuller understanding of the nature and advantages of' the 

invention, reference should be made to the ensuing detailed description 
taken in conjunction with the accompanying drawings. • 

' • BRIEF DESCRIPTION OF THE DRAWINGS 
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Fig. 1 is a block diagram of a processor containing an operand 
cache in accordance with the Invention. v 

fig. 2 illustrates the format of each entry of an operand cache. 
Pig • 3 illustrates a decoder containing an operand cache. 
Fig. 4 illustrates the comparison performed within the address 
15 processor between calculated and cached operand addresses. 

Fig. 5 illustrates the comparison performed within the execution 

_ u ni t™ b e t we e n- a n-a c t u a i— ope r and— a nd— a ~c a c h ed ™o pe r a nd ; — — ? - 

Fig. 6 is a' block diagram of a personal computer incorporating a 
processor that contains an operand cache in accordance with an embodiment of 
20 the present , invention; 

Fig. 7 is a block diagram of a networked server computer 
\. incorporating a processor that contains an operand cache in accordance with 
an embodiment of the present invention; and \ ^ 

Figv 8 is a block diagram of a multimedia computer incorporating 
25 a processor that contains an operand cache in accordance with in embodiment 
of the present invention. 

DETAILED DESCRIPTION OF ft SPECIFIC EMBODIMENT 
■ Processor Overview ; ■« ■ 

; 30 -\ Fig.. 1 depicts a processor 100 containing an operand cache (OC) 

, 104 in- accordance with 1 the Invention. Processor 100 is described as a 
A pipelined 'system whose various stages operate independently to a large ■ ■. 

, degree (and is: similar in most respects to the processor taught in the • 126 
L . patent, incorporated by reference, above) . However/ it will be apparent to 
35 j^: one of ordinary skill in the art that the present invention is applicable to 
rot her .? processor architectures such as traditionai lock-step pipelines and 
unpipe lined systems . 

Processor 100 includes a decoder unit (DEC) 101,- an address 
preparation unit (AP) 102, an execution unit (EU) 103, possibly other 
40 execution units (not shown) and a memory/ cache subsystem 105. Hereinafter, 
the term "execution unit (s) w used without a reference numeral also 
encompasses AP 102 and is used to describe entities that execute (carry out) 
the data or address manipulations required by the operations. Generally, 
these units are capable of superscalar pipelined speculative and out-of- 
45 order execution. , Each execution unit provides termination status 

information, regarding each operation it has executed, to DEC 101. In 
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particular, AP 102 (EU 103) sends termination status' information to DEC 101- 
via a,, termination status bus 115 (116). 

AP 102 contains a relabeled copy of; the general, purpose 
registers and segment registers and -has the hardware resources for 
performing segmentation, and paging of virtual memory addresses. A duplicate 
copy of the general purpose registers is also maintained in EU 103 . The 
segment- and descriptor virtual register^ file is entirely and-solely within 
the address unit. AP 102 calculates addresses for memory operand reads and 
writes, .control transfers (including branches, calls, and protected-mode 
gates), and sequential instruction execution across page boundaries (page 
crosses ) . 

In one embodiment EU 103 is an integer execution unit " 
responsible for executing integer arithmetic operations. As mentioned 
above, EU 103 also contains a relabeled virtual copy of the general purpose 
15 registers (kept coherent with the copy of AP 102 ) ; and has the hardware 
resources for performing integer arithmetic, and logical operations f 
_ -Oupl-icate-^general register f iles insure minima 1"^^ 

and the logic that manipulates their values. It also permits read and wpite 
accesses based on the needs of each execution unit, which would otherwise 
20 require an increase in the number of read and write ports. 

DEC 101 specifies an instruction octet to be retrieved 
(sometimes referred to as "prefetching") by initially passing the initial 
address of .a sequence of the instruction octet to MCS 105 via an address bus 
108 and then requesting additional octets from memory via control signals 
25 (not shown). If MCS .105 contains a level one (i.e. on-chip) cache, it first 
attempts to retrieve the specified instruction octet from the level one 
.cache. If the level one cache does not store the specified octet or there 
is no level one cache, MCS 105 then attempts to retrieve the specified octet 
from a level two (i.e. off-chip) cache 106. If level two cache 106 does not 
30 store th^ specified octet , then MCS 105 retrieves the, specif ied octet from a 
main memory 109. (The combination of cache 106, memory 109 and the level 
one cache, if existing, is referred-to hereinafter as -the memory.) In any 
event, the retrieved instruction octet is passed by MCS 106 to DEC 101 via 
an instruction fetch bus. 107. 
35 DEC 101 decodes instructions from the octets retrieved by MCS 

105 and; translates each of the instructions into one or more simpler 
operations that the various execution units (i.e. AP 102, EU 103 and 
possibly other units not shown) understand. These operations are sometimes 
referred to as pseudo-operations or p-ops. DEC 101 retains information 
about the translation process for later action based on termination 
information received from the various execution units, discussed below. In 
particular, DEC 101 includes a history RAM 120 for storing the address of 
the instruction corresponding to each issued operation. As will be 
discussed below, upon abortion of an instruction due to an operation that 
executes with an incorrect operand from OC 104, instruction retrieval will 
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commence with the instruction address stored in the entry of history RAM 120 
corresponding to the erroneously executed operation. 

DEC 1 01 also is sues (transmits ) the opera t ions t o t he v a rious 
execution units .via, an operations; bus 110. A plurality of the control 
: fields of an operation are operation commands unique to a specific execution 
unit. Ot he r ope r at ion fields (such as source and destination specifiers) 
are shared among ^mult iple execution units. Each operation is a collection 
/or "packet" of all unique and shared control command fields. The fields are 
either largely unencoded or recoded into formats that greatly facilitate or 
eliminate any further decoding by the execution units. Every operation 
packet is issued to all units simultaneously. 

Each instruction will result in one or more operations being 
issued. DEC 101 time-stamps each operation; it generates. These time-stamps 
are called tags. The tagged operations are issued to the execution units 
v ia opera t ions bus 110 and each unit autonomously executes the operations 
for whlch.^it is responsible. Whether or not a particular execution unit 
w -yi ~i— eic ecu te-iaPgrry e n~o pe ration varies a c c o rding~to whet he r th e or iginal ~ 
inst ru c t i on r equ i red the specialized function associated with the execution - 
unit > Each operation issued by DEC 101 is given a tag which uniquely 
identifies each -operation currently outstanding in the machine. Tags are 
issued in increasing order, allowing easy determination of relative age of 
any two outstanding tags. Up to 14' operations are allowed to be 
outstanding. Each tag is a 5-bit quantity, thereby permitting a two's 
complement signed comparison to indicate the relative age of two tags. 
However, at any point in time, only the four least significant bits of a tag 
are required to unambiguously identify an operation . ( In other embodiments, 
more or fewer operations could be permitted to remain outstanding, thereby 
per ha ps changing the numbe r of b its requir e d f o r ; each t ag . ) : - Upon the 
issu$;ncer&f ^angioperation, DEC .101 stores the address 6 ft he associated 
in s tr u c t ion< i n t he entry of history- RAM 102 indexed by the least four 
signif icant bits of the operation • s tag . WhehJ an abort occurs rolling back 
the processor state to j ust before the operation with tag*i, then DEC 101 
issues new opera t ions from this point with tags starting at i, in order to 
ensure -the: reliability or relative age comparison. 

? Bus transactions between functional units (i.e. DEC 101 , AP 102, 
EU 103 and MCS 105) include the tag of the originating operation." 
Functional units pair up operations, addresses, and operands with these 
tags . The execution units provide terminations used for pipeline control to 
DEC 101. The use of terminations is discussed further below in the section 
cn speculative execution. DEC 101 is responsible for tracking operations 
from the time of issue through the time of retirement. If all the 
operations associated with an instruction are processed by the required 
execution units without any errors detected (all normal terminations), then 
all the operations for the instruction are simultaneously retired. If any 
of the execution units detect any faults on any of the operations, then all 
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. 6t\ the operations associated with the instruction and all "younger" p-ops 

^- ;. are - aborted .. ' * * . " 

■f. An internal operation -is considered ".outstanding" from the time 

i*- is issued up until the time it is retired or aborted. It is considered 
5 r ;-! ful - 1 y terminated** as soon as it is terminated by all required functional 
. uh^ts. Since operations are retired or aborted in groups (normally 

associated with; an instruction) an operation may be fully and normally 
■p ■ t erm'inat ed but still reoia i n ou t st and i ng . 
^ Tne execution units receive memory and I/O requests including 

IP returned -data and instructions via MCS 105, discussed below. MCS 105 

includes write .reservation queues for write operations and read after write 
short-circuit paths. The write reservation queues permit the execution 
.;i ' units to post their results for an operation to MCS 105 and then proceed 
V"- without waiting to execute their, operation. The write reservation queues 
independently accept physical.; addresses and result data, which are generally 
independently generated out-of-order by different units. All addresses and 
da t a™a r e~s e n t-t o^KCS™ 10 5 "witrT~t:he^tra^^f7xl^ operat ion assocTaticf with "the 
address or data, the write reservation queues then use the tags to . V" 
correctly associate the addresses with the data. The write reservation 
queues also use the tags to enforce the original sequential program order 
for all stores to memory. 

Stores to memory are considered irreversible, and hence are only 
performed when it is completely safe to do , so. MCS. 105 waits to do the 
store until the tag associated with the; store is older than the oldest 
25 outstanding operation, as indicated; by the DEC on a tag status bus" 111 . 
Reads to locations where an "older" . write is pending in the write 
reservation queues get their data from the write reservation queue instead 
°^ from memory. , This is the read, after write short-circuit mentioned above. 
More details ori jthe utilization of write reservation /queues is taught in the 
30 '126 patent, incorporated by reference, above . • 

Each execution unit has/: its own queue into which incoming 
operations are placed pending* execution and is free to execute its 
operations largely independent of the other execution units. Each execution 
unit only executes those operations that require .processing by that unit. 
35 It -is possible , for one unit to- finish an operation associated with a 

"youngerr Instruction; while another unit is still -executing an operation 
corresponding* to an older instruction. Thus, instructions can be executed 
in other than their original program orders Such out-of-order execution 
tends to localize the effects of dependencies to a single execution unit. 
40 Thus a dependency in AP need not stall the EU f and vice versa. Because of 
their loose "coupling and independent execution, stalls that affect only one 
.execution unit can be effectively absorbed when that unit is later able to 
proceed past another execution unit that is held "up due to a different 
dependency. If out-of-order execution were not used, often many of the 
45 execution units would be unnecessarily idle. Out-of-order execution results 
in the execution units doing useful work most of the time. 
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While the execution units :are busy, DEC 101 continues to fetch" i 
and decode instruct ions , and; issue operations , in a manner largely- decoupled 
from the activity of the execution units. Upon complet ing^^ e of each 

operation, the^execut ion units send termination signals to DEC 101 (in 
5 particular, AP 102 and EU 103 use termination busses 115 and 116 , 

respectively, for this purpose) . These terminations are associated with the 
operation's tag. DEC 101 keeps track of ai 1 termi^ corresponding to 

each tagged operation. Since the tags are like time-stamps, 'the relative 
age; of operations is discernable from the tags. Issued operations that have 

10 yet to be ret ired or aborted are 'cons ^ider^d butstandingV " If -an operation is 
normally terminated by y all units involved, DEC 101 Will "retire" the 
■ ■ operation, unless an older operation is still outstanding. 

Ii an operation is abnormally terminated, DEC 10 aborts* 4 all 
• operations equal tp^ and younger ;t hah the operation! that was 'abnormally 

15 terminated/ i . ev informs ail execution units ( and MCS 105 ) to stppi 5 executing 
°P® ratip " 113 ^ having the\tag of the operation or a tag of a younger (i.e. later 
is sued ) opera t ion by setting an abort bit of tag status bus 111 and placing 
the tag of the abnormally terminated operation on tag status bus 111. (When 
the tag bit of tag status bus 111 is cleared, the tag number carried on bus 

20 111 indicates the tag of the oldest outstanding operation. ) Instruction 
aborts cause the processor state to revert to that associated with some 
previously executed instruction. Multiple operations may be retired or 
aborted simultaneously. As will be discussed below, ah operation may be 
aborted because it executed using an operand from bjaerand cache 104 later 

25 :"• determined' to. be incorrect. 

The execution units maintain archival versions of those portions 
of the processor state that -have changed in associationwith: the execution 
of thes outstanding operations . .i When operations are retired, the changes 
made by the retiring operations" are; made irreversible. In contrasts when 

30 operations are aborted, the processor state reverts to that existing prior 
to the execut ion of the aborted operations . Again, the tags are used to 
define the pre else time t o which the processor state is reverted. (Tags are 
also used in^MCS 105 to enforce the order in which the program expects data 
to be stored, as discussed above .) 

35 Register reiabeling (also known as register reassignment)! is one 

possible technique to enable processor 100 to perform speculative- execution. 
It is a method by which processor 100 maintains the archival versions of the 
general register file, the segment registers, and the associated "hidden" 
descriptor registers. The number of registers within each of AP 102 and EU 

40 103 ( "physical registers-) is larger than the number of registers that can 
be specified by an instruction ("virtual registers-). Specifically AP 102 
and EU 103 have as many additional registers as there are possible 
outstanding operations than can change the registers. Unlike conventional 
register files, the register files of AP 102 and EU 103 do not have 

45 permanently assigned register names. instead, the register name (i.e. 

virtual register associated with) of each physical register varies with time 
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and the requirements of the program. Further, multiple physical registers 
can correspond to the same virtual register, each of the multiple registers 
being associated with the execution of different outstanding operations. 

The register relabeling is transparently managed by DEC 101 . At 
5 the time each operation is issued, DEC -101 "relabels" (or reassigns) the 
.(virtual.) register specifiers used by the .Instructions into new and 
different (physical) register specifiers that are part of each operation. 
DEC 101 is. responsible for tracking the tagged operations , maintaining a 
pool of available physical registers (the free list), maintaining the 

10 mapping between the physical and virtual register specifiers for each 
operation, and controlling whether to retire or abort operations, as 
discussed, above. Register relabeling reduces the per formance r degradation 
associated with the abbrting of speculatively executed operations. Aborts 
cause the state' of processor 100 to revert to that associated with some 

15 previously executed operation. Aborts are largely transparent to the 

execution units, as most processor state reversion is managed through the 
~~ aanTic reg ister^ r e labeling specif led by DEC f0~l~~ in s ubseque n tTy Ts s u ed ~ 
operations. The '126 patent, incorporated by, reference above, describes a- 
technique of register reassignment in more detail. 

20 : ^ , ■ ' 

Operand Cache 

For the purposes of illustrating the invention, the discussion 
will focus on an operation that is to be executed in EU 103 and that 
requires an operand that is stored in the memory (hereinafter referred to as 

25 "the read operation", the associated instruction hereinafter referred to as 
"the read instruction"). DEC 101 places the read operation (as well as any 
other operations associated with the decoded read instruction) onto 
operation bus 110. AP 102 reads the read operation from operation bus 110 
arid calculates the address of the operand associated with the operation. 

30 Then AP 102 passes this address onto MCS- 105 via address bus 108 in order to 
retrieve the required operand from the memory..^ MCS 105 retrieves the 
required operand from the memory and places it on a memory and I/O read data 

: , bus 113. , •.. ■ > ,. ' ^ 

After decoding the read instruction into the read operation (and 

35 possibly other, operations ) and before placing the read operation onto 

operation bus 110,. DEC 101. attempts to retrieve an operand and an operand 
address for the .read instruction from OC 104. The success of this attempt 
depends on the fulfillment of three conditions, discussed below. If the 
attempt is successful, DEC 101 passes the operand retrieved from OC 104 

40 along an operand datum (OD) bus 114 to EU 103 where it is stored in a memory 
data file (MDF) 121 until EU 103 is ready to execute the read operation, as 
discussed in more detail below. In addition, DEC 101 passes the operand 
address retrieved from OC 104 along an operand address (OA) bus 114 to AP 
102, where it will later be compared with the actual operand address 

45 calculated by AP 102, as discussed below in more detail. 
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By providing an operand from OC 104/ execution of the read 
operation can proceed immediately on a speculative basis, i.e. it becomes 
unnecessary for EU 103 to wait for AP 102 to calculate the operand address 
and for MCS 105 to retrieve the operand stored at; this address . (It is 
5 noted, however , that these two tasks are still required in order to verify 
the correctness of the c ate he d operand, as described below. In other • 
embodiments, consistency -between the memory and operand cached 104: migh£ be 
alw ?Y s maintained, thereby; obviating the need to verify the correctness of 
the cached operand . ) Thus, the interlock on the read operation tan be 
10 speculatively removed. In addition, due to this earlier execution of the 

read operation, the interlocks on other operations waiting on the results of 
the read operation may be removed earlier than they would otherwise be if 
the' read, operation did not execute until the actual operand was returned 
from MCS 105. 

" r ; Without operand cache 104, the delay in executing the read 

operation incurred; while waiting for AP 102 and MCS 105 to perform the tasks 
described immediately above would also delay operations, subsequently placed 
on operations bus 110 by DEC 101, that depend on the result of the execution^ 
of the read operation. Given that only a fixed number of operations are 
permitted to remain outstanding within processor 100 at any point in time, 
the delay (associated with operand add re s s c a 1 c u 1 at i on and subsequent 
operand retrieval from memory) in executing read instructions in a processor 
similar to processor 100 but without an operand cache would, on average, 
result in more idle cycles. An idle cycle is a cycle during which one or 
more of the execution units are unused (i.e. don't have operations in their 
respective queue that they can execute during the cycle) , arid, thus, 
represents subbptimal use of the processor. 

^ ^ Figi. 2 depicts the format of each entry of OC 104. Each entry 
consists pf a 32-bit instruction address (TA) field, an operand address (OA) 
30 field ^pr storing a 32-bit byte address, a 32-bit operand datum (OD) field 
'l; f or storing up to a double word, a 1-bit Valid I A (VIA) field, a I^bit Valid 
OA ( VOA) field, a 1-bit Valid OD (VOD) field, and a 3-bit COUNT field. The 
IA field holds the address of an instruction which makes a memory read' 
access] (x.e. a "read instruction" requiring an operand stored in memory). 
35 The 00 field holds a predicted value for the operand associated with the 

instruction (this value hereinafter referred to as a cached operand) . The 
OA field holds an operand address corresponding to the, cached operand. The 
two least significant, bits of the OA field and the size of the operand 
determine the particular bytes of the OD field used to store the cached 
40 operand. For example, a 2-byte operand is stored in the third and fourth 

bytes of the OD field if the 2 LSBs of the OA field are 10. The OD field in 
the described embodiment does not store an operand that straddles a double- 
word boundary. (In other embodiments, the size of the OD field could be 
different. For example, if the OD field holds 8 bytes, then the particular 
45 bytes of the OD field in which a cached operand is stored are determined by 
the 3 LSBs of the operand address stored in the OA field and the size of the 
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: ' 'operand. ) The - VIA, VOA and VOD-; 'fields store ! 1-bit valid bits for the IA, OA 
andVoD fields, respectively. 

;The COUNT Jf ie Id stores a measure of the. likelihood that the 
corresponding OD field stores the correct operand value (i.e. is -equal" to 
■ S the operand" in the memory specif ied by the read operation) . In another 

embodiment, the COUNT field could be replaced by a 5-bit history bit pattern 
(HBP) field that contains a respective bit indicating .the result of each of 
■ the last five comparisons (1 indicating equality) between the operand cached 
in the entry and the actual operand retrieved by-MCS 105. The indicated 
10.-; size's^ for the COUNT or history bit pattern fields are clearly not critical, 
and could be varied in alternative embodiments. 

'■f * The part of OC 104 that stores the . I A fields is a fully 
a s soci a t ive Co nterit Addre ssab 1 e Memory (CAM) ■ s truct ure . Thus, 0 EC 101 
passes the address of a read instruction to OC 104,, which instruction 
15 address i;s compared to the IA field of each entry of OC 104. (It will be 
obvious; to one of ordinary skill in the art that in other embodiments, the 



^ 0 _ c ^^^ _ j__ - ^ e ^ t h e~i x^f ra s c 6u IdT^it h e r isie~~a™~d Xr e c tT- m a pped 
structure or a set -associative structure, both. of which would be indexed by 
a part of a specified instruction address. In such: embodiments, the IA 

20 field would only have to store the part of the instruction address not used 
for indexing;) The part of OC 104 that stores the OA fields is also a fully 
associative Content Addressable Memory (CAM); structure and is used to 
maintain coherence between operands stored in the memory and operands cached 
in OC 104 iv as described below. 

25 Fig. 3 illustrates the operation of DEC 101 in more detail. A 

program; counter 301 determines an address (prefetch address) of an 
instruction to Jbe retrieved from memory. The prefetch address is passed 
from program counter. 301 to instruction arid fetch decode logic 302 (which 
initiates the retrieval of the instruction octet from memory and carries out 

30 the decoding of the retrieved instruction into one or more operations to be 
carried out; by AP 102 and EU 103) and to OC 104. Operation of OC 104 is 
determined by the following three conditions: 

1) The IA field of an entry of OC 104 matches -the prefetch instruction 
3 5 . address. 

2 r ) Each of the VIA, V*OA and VOD bits in the matching entry is set. 
3) JThe value in the COUNT field of the matching entry exceeds a 
.predetermined threshold. (In the alternative embodiment described above 
where the. COUNT field is replaced by a history bit pattern, the third 
40 condition is satisfied if the prediction bit in the entry of a 32-entry 
history ROM (not shown) indexed by the 5-bit history bit pattern of the 
matching entry is set. As an alternative to the history ROM, the third 
condition could be satisfied if the prediction bit output of a combinatorial 
circuit (not shown) is set when the circuit's 5-bit input is the history bit 
45 pattern of the matching entry. Techniques for determining the appropriate 
prediction bit for a given history bit pattern are well known in the art.) 
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< A control logic 303 contains a comparator {not shown) for 
comparing the COUNT field of the matching entry with the predetermined 
threshold (or contains the above-described history ROM or combinatorial 
circuit/ in the history bit pattern embodiment) . In addition, control logic 
5 303, contains a 16-entry address table 304 for storing the prefetch addresses 
associated with the outstanding operations. In particular, 7 the current 
prefetch address is stored in the table entry indexediy the four least 
signif icant bits of the' tag of the read operation associated with the 
prefetched instruction. The use of tabie304 is described further below. 

10 If all r three of the above, conditions; are satisfied; then DEC 101 

retr A eves an operand and corresponding operand address from OC 104, passes 
the cached operand to EU 103 on operand datum (OD) bus 114, passes the 
cached operand address to AP 102 on operand address (OA) bus 112, and 
control logic • 303 asserts a *' HIT" signal; on line 1 10A (which is part of 

15 ■ opera t ion bus 110 ) . The M H IT T s ignal ind ickte s to EU 103 that /the operand 
passed on OD bus il4 can be used to (s peculativel y ) exe cute the read i 
operation associated with the prefetched instruction (also passed to EU 103 
by DEpiOr on operation bus 110). The; particular bytes of the OD field that" 
are retrieved are determined by the 2 LSBs of the OA field and by the size 

20 of the operand required by the read operation 

[ If only the first two of the above three conditions are 
satisfied; then DEC 101 retrieves an operand and corresponding operand 
address from the matching entry of OC 104, passes the cached operand to EU 
103 on OD bus 114, passes the cached " operand 'address to AP 102 on OA bus 112 

25 and control . logic 303 asserts a, "THIT" signal, on line 1108 (which is part of 
operation bus 110)" which indicates to EU 103: that the operand passed on OD 
bus 114 :is not to be used for speculative execution of the read operation. 

I f ; 1 The purpose of the third condition involving the COUNT field (or 

aiternativeiy^a history bit pattern fie Id) \ is i .to avoid the overhead 

30 (described below) incurred when a read instruction (and all subsequently 
decoded instructions) must be aborted in the event that a read operation 
c o r r es po ndi ng t o' t he read instruction executes using an incorrect operand 
value from OC 104 The COUNT field of an entry of OC 104 provides some 
i nd ic atibn of the 1 ike 1 i hood of the correctne ss of the value sto red in the 

35 OD field of the entry by keeping track of the accuracy of the operand cached 
in the OD field during previous executions of the same read instruction. 
(In other embodiments, values other than a count or a history bit pattern 
could be used to provide such an indication.) The reason a cached operand 
.value and a cached operand address are provided to EU 103 and AP 102 . 

40 respectively, even if only the first two conditions above are satisfied is 
to permit the cached operand to be compared against the memory operand for 
the ^purpose of updating the COUNT (or HBP) field, as described below. 

After DEC 101 has decoded the prefetched instruction, placed the 
associated operations on operation bus 110 and (possibly) placed a cached 

45 operand and a cached operand address on busses 114 and 112, respectively, 
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DEC 101 passes the next prefetch address to MCS- 105 and repeats the above 
process for the next instruction to be executed. 

. . V "Writes to MCS 105 could result in inconsistency 'between a cached 

operand in PC 104 and the corresponding operand stored in the memory. If 
5 the'/entry s t o r ing t he ^ a c hed (but outdated; due to a write to memory) 

ope rand :mee t s ; t he above three conditions, then EU 103 might speculatively 
execute . " 3in< 5 : the outdated cached operand. 'In order to reduce the frequency 

, of speculatively but- erroneously-executed instructions, processor 100 

? employs- the following coherence scheme to keep OC 104 more current: 

10 When processorVlOG speculatively performs a write to memory, the 

( 3 ,2-bit) write address placed on address bus 108: by AP 102 is compared 
against the OA ^ field- -of' every entry of OC 104. The value. to be written 
i(which will have been placed on a- memory and I/O, write data bus 117 by 
either AP; 102,7 or . EU 104 ), is used to update the OD field of each entry of 0C 

15 ; 104 whose OA field. matches the write address . (Alternatively , "the updating 
of OC 104, could be postponed- until the instruction requiring the write to 



~T~ memory~tTas^been. successfully retired. )■■— ~ • - . v ■ . : 7 r 

When a. device other than processor 100 performs a write to w 
; memory,, the write address placed on address bus 108 by MCS 105 is compared 

20 against the, OA f ield of every entry of OC 104. The value to be written 
(.which will jhaye been placed on memory and I/O write data: bus 117 by MCS 
105) is. used to update the OD f ield of each entry of OC 104 whose OA field 
matches the write t address. 

A write address will match the OA field of an entry of pc 104 

25 when the ^30 MSB s of each are the same. The number of bytes to be written, 
and the 2;>.LSBs of the write address will ; determine the particular bytes of 
the OD field of. : a -matching entry of OC >,104 ' that are updated . : For example, 
if the 2 LSBs of the write address for a two byte; transfer are 10 then the 
third; and fourth bytes of the OA; field will be updated. If the OA field of 

30 P a pticular;entry only caches the^ second' of the two bytes that are written, 

then the; 2 LSBs of the OA field will be 11. If all 32 bits (instead of the 
30 : MSBs)' of the :,OA field and the write address were compared to determine 
matching entries of OC 104 then the particular entry described immediately 
above would not be modified and the one byte operand cached therein in the. 

35 ^fourth byte of the OD field would no longer reflect the value of the operand 
/ stored in memory, thereby resulting .in ; a greater chance of having to abort a 
speculatively executed (with an incorrect cached operand) instruction. (In 
other. embodiments, the size of the OD field could be different. For 
example, if the OD field holds 8 bytes, then the particular bytes of the OD 

40 L field of each matching entry that would be updated would be determined by 

the 3, LSBs of the write address and the number of bytes written. A matching 
entry would be one whose OA field has the same 29 MSBs as the write 
address . ) 

As follows from the above discussion, the purpose of having an 
45 0A * ield in oc 104 is t0 nel P maintain the consistency between operands 

cached in OC 104 and corresponding operands stored in the memory. In other 
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embodiments, the OA field (and thus the VOA field) could be eliminated at 
the ;cost of reduced .performance caused by a greater frequency of 
speculatively executed instructions that have to be aborted (due to a 
greater frequency, of outdated operands cached in OC .104). the conditions, 
5 in such embodiments, for providing an ;operand from ; an entry of OC 104 to EU 
103 would be 1 ) the IA field matches the address of the read;: instruction AND 
2) the VIA and VOD bits are set. In addition, the COUNT field would, have to 
exceed the predetermined threshold in order for EU 103 to be permitted to 
speculatively; execute with, the cached operand. The COUNT field would be 

10 incremented (decremented) upon determining that the cached and actual 

. operands are equal (unequal ) / ',• ■ 

'" * Fig. 4 illustrates the : comparison perform AP 102 

between a cached operand address passed to AP 102 and the correct operand 
address computed by AP 102. When a cached operand address is passed to AP 

15 102 on OA bus 112, as described above, the cached operand address is, stored 
in an address queue (AQ) 401 (in an entry, of AQ 401, indexed by the four 
least significant bits of the tag of the associated read operation ). Later, 
when AP 102 has calculated the operand address associated with a read ~~ 
operation having a particular tag, a comparator 402 compares the calculated 

20 operand address with the cached operand address stored in AQ 401 

corresponding to the same tag. The result of the comparison is output on a 
line 115 A which is part of termination status bus 115 (Fig. 1) on which AP 
102 reports the status of its address calculation to DEC 101. 

If the termination status placed by AP 102 on bus 115 indicates 

25 that the occurrence of a fault during calculation of the operand address 

r (e.g. illegal address) then DEC 101 clears the VIA, VOA, and VOD bits of the 

. ; : entry: in" OC 104 for the read instruction, when the read instruction is 
retired (i.e. all previously retrieved instructions have successfully 
executed) . At this t ime , DEC : 101 aborts the read instruction (and 

30 subsequently decoded instructions), i.e. informs all execution units (and 

MCS 105) to stop executing operations having the tag of the oldest operation 
issued as a result of decoding the read instruction (not necessarily the 
operation that speculatively and erroneously executed) or a tag of a younger 
(i.e. later issued) operation. This is preferably done by setting the abort 

35 bit of tag status bus 111 and placing the tag of the oldest operation issued 
as a result of decoding the read instruction on tag status bus 111. In 
addition, DEC 101 sends an operation over operation bus 110 requesting AP 
102 to calculate the address of the appropriate fault handler. After 
performing the required calculation, AP 102 sets a program counter to the 

40 calculated address so that execution of the fault handler may proceed. 

On the other hand, if the termination status placed by AP 102 on 
bus 115 indicates only that the calculated operand address and the cached 
operand address (passed to AP 102 on OA bus 112) are unequal, then DEC 101 
clears the VOD bit and updates the OA field with the correct operand address 

45 (that was calculated and placed on address bus 108 by AP 102). 
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Fig. 5 illustrates the comparison performed within EU 103 
between an operand cached in OC 104 .(and passed to EU 102 via OD bus 114) 
and the actual operand stored at the operand address computed by AP 102 and 
returned by MCS 105 to EU 10.2 ^via memory and I/O read data bus 113. When 
5 passed to EU 103 , a cached operand is stored in an entry (indexed by the 

four least significant bits of the tag of the associated read operation) of 
memory data file 121 (MOF) , . where the cached operand remains until EU 103 is 
ready to use it. Each entry of MDF 121 can store an. operand and further 
includes three valid bits, i.e. a Prediction Valid (PV) bit, a Prediction 

10 Used bit (PU) bit and a Datum Valid (DV) bit. 

When a cached operand from OC 104 is written into an entry of 
MDF 121, the PV bit of that entry is set (thereby informing EU 103 that an 
operand is available for the read operation whose tag indexes the entry). 
If no data is provided to MDF 121 when the operation is issued on operations 

15 bus 110 the PV bit is cleared. . In addition, the entry's PU and DV bits are 
cleared at this time. If EU 103 uses the cached operand (which is 

permissibleVonly -if ^ read — 

operation was issued on operations bus 110) before the, actual operand is ^- 
returned from MCS 1,05, then the EU 103 sets,!- the PU bit' of the entry (thereby 

20 recording the fact that the cached operand was used for speculative 

execution ) . . 

When the actual operand stored in the memory (and the tag of the 
read operation that caused this operand to be retrieved from the memory) is 
returned from MCS 10S over memory and I/O read data bus 113 to EU 103, EU 

25 103 provides the cached operand; , which is stored in the entry of memory data 
file 121 indexed by the four least significant bits of the tag, to an input 
503 of a comparator 502, writes the actual operand into the entry (thereby 
overwriting the cached operand), and, sets the entry's DV bit (thereby 
recording the fact that the actual operand is-now stored in the entry and 

30 thus can be used by EU 103 for nbn-speculatiye execution) . In addition, the" 
actual operand is provided to an input 504 of comparator 502, The result of 
the comparison between the cached and ; actual operands is provided to DEC 101 
via line 116A, which is part of an EU. termination status. bus 116 (Fig. 1). 
In adtdition, EU 103 provides DEC 101, via status bus 116, with an indication 

35 of whether the cached operand was used for speculative execution, by EU 103 
(i.e. the value of entry's PU bit). 

If the VOD bit in the entry of OC 104 for the read instruction 
is cleared (which would be the case, for example, if the cached operand 
address did not equal the operand address computed by AP 102, as described 

40 above), DEC 101 updates the OD field with the actual operand (placed by MCS 
105 on memory and I/O read data bus 113) and sets the VOD bit. In addition, 
if the VOD bit in the entry of OC 104 for the read instruction is set and 
the status returned from EU 103 to DEC 101 on bus 116 indicates that the 
cached operand was not equal to the actual operand stored in the memory, DEC 

45 101 updates the OD field with the actual operand. 
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, ; ^ " If the status^ returnee! from EU 103 to. DEC 101 on bus 116 
v : J , indicates that the cached operand Was not equal to the actual operand stored 
i n the memo r y AND that the cached operand was. used for speculative execution 
: by E ^ 1 03 , then DEC 101 abort s the associated read instruction and alb 
5 subsequent ly decoded instructions .'• (In other embodiments, processor 100 
" might not abort subsequent! y decoded instruct ions , or even other ope rat ions 
associated with the read instruction, that don't depend on c o r r e c t execution 
of the read operation. ) to achieve this/ DEC 101 commands aM execution 
un * t ® (and MCS 105) to stop executing operations having the tag' of the 
io oldest operation issued as a result of decoding the read instruction (not 

necessarily the read operation that speculatively and erroneously executed) 
9£ 3 tag of a^younger (i.e.; later is sued ) oper at ion by setting the abort bit 
■■ ■ of tag status bu s 1 1 1 and p lac ing the .; t ag b f > t he oldest operation issued as 
a result of decoding the read instruction on tag status bus 111 . DEC 101 
15 retrieves the address of the read instruction from the entry of history RAM 
120 indexed by the four least significant bits of the tag of the operation 
~~^art~execnt^ed ; erroneously; sets program counter "301 to this address { thereby 
recommencing the usual process of prefetching instructions starting with the w 
aborted read ^instruction) , and starts decoding prefetched instructions and 
20 issuing operations to the execution units; starting with a tag equal to that 
of the oldest operation associated with ; the abort read instruction. The 
register reassignment technique, described above, enables DEC 101 to reset 
the register state of processor 100 to that existing before the execution of 
the read instruction. 
25 The above description demonstrates that there is a significant 

cost to special, a t i ve 1 y executing with an incorrect operand from OC 104 (as 
oppo s eci t o; j u s t < wa i t ing for the actual operand to be returned from the 
memory) . ; This; includes the cost of prefetching and decoding the aborted 
. read instruction and subsequent instructions that would have already been 
30 prefetched and decoded/ and perhaps partially or totally executed, by the 
time the: actual operand was retrieved from the memory if speculative 
execution of the read instruction had not been/undertaken. 1 1 is for this 
reason, i.e. to minimize the cbst of erroneous and speculative/ execution 
while obtaining the benefit of some speculative and correct execution that 
35 it may be useful to store a COUNT field (or some similar entity ) in each 
entry of OC 104, as described above . 

If the status returned to DEC 101 on bus 116 indicates that the 
cached operand was not equal to the actual operand stored in the memory but 
that the: cached oberand was NOT used by EU 103 , then DEC 101 does not abort 
40 the associated instruction because EU 103 will execute ( non-speculatively ) 

with the actual operand which has overwritten the cached operand in MDF 121, 
r as: described above . 

' DEC 101 increments by one the' value in the COUNT field of the 

appropriate entry of OC 104 if 1) the status returned by AP 102 on bus 115 
45 indicates that the cached and calculated operand addresses are equal AND 2) 
the status returned by, EU lOiZ on bus 116 indicates that the cached and 
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retrieved (from the memory) operands are equal. Otherwise (i.e. either the 
cached and calculated operand addresses are unequal OR the cached and 
retrieved operands are unequal), DEC. ,101 decrements by one the COUNT field 
of vthe appropriate entry of OC 104. (In some embodiments, the COUNT field 
could be reinitialized to a small value, e.g; 0, when the cached and 
calculated operand addresses are unequal regardless of the result of the 
comparison between the cached and actual operands. In some embodiments, 
COUNT could, be incremented by one whenever the cached and retrieved operands 
are equal, even when the cached and calculated operand addresses are 
unequal. 

In the alternative embodiment described above where the COUNT 
field is replaced by a history bit pattern (HBP) field, the bit (of the HBP 
field of the appropriate entry of OC 104) associated with the oldest 
comparison is shifted out of the HBP field, and i (0 ) is shifted into the 
KBP field if the cached and retrieved operands are equal (unequal). Control 
Logic 303 contains a shift register (not shown) for modifying, the HBP field 
in the above manner. 

The appropriate entry of OC 104, whose COUNT field (or HBP 
field) is to be modified (as described immediately above) is the entry of OC 
104 whose IA field matches the prefetch address associated with the 
termination status returned by EU 102 on bus 116. This prefetch address is 
retrieved by -control logic 303 from the entry of address table 304 that is 
indexed by the least four significant bits of the tag indicated in the 
termination status. 

Modification of the value in the COUNT (or HBP) field occurs 
even when a cached operand passed to EU 103 is not used by EU 103 for 
speculative execution (either because DEC 101 asserted the THIT signal 
because of an insufficiently large value in the COUNT field (or, in the HBP 
embodiment, because of the particular bit pattern in the HBP field) or 
because the actual operand was provided to EU 103 by MCS 10S before EU 103 
used the cached operand). The value in the COUNT field is updated by either 
adding or subtracting (depending on the results of the above comparisons) 
one with a saturating adder 305 located in control logic 303, i.e. the COUNT 
cannot be incremented (decremented) beyond a certain maximum (minimum) 
value. The predetermined threshold of the last of the conditions (discussed 
above, whose satisfaction is required before an operand from OC 104 will be 
used for speculative execution) lies between the maximum and minimum values 
for COUNT. In one embodiment where the COUNT field is 3 bits, the maximum 
and minimum values for COUNT could be 7 and 0; respectively. 

A new entry in OC 104 is created whenever OC 104 does not have 
an entry whose IA field matches the address of a read instruction requiring 
an operand. If possible, an entry whose VIA bit is cleared is selected for 
the new entry. If no such entry exists, then an entry with the smallest 
value in its COUNT field could be chosen. (Other replacement algorithms 
will be obvious to one of ordinary skill in the art.) In the selected 
entry,, the VOA and VOD bits are cleared, the VIA bit is set, and the address 
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. of the read instruction is written into the IA field. Later when AP 102 
generates the. operand address; the operand address (placed on address bus 
108 by AP 102 ) is written into the OA fields of the new entry, and the new \ 
entry VsVOA bit is set. When MCS 105 returns (on memory and I/O read data 
5 ; bus 113):; the; actual operand stored in the memory, the actual operand is 

written into the new entry ' s OD' field, .the new entry's VOD bit is set, and 
th ® new entry's COUNT field is. set to an initial value (such as zero). 

: /; v The : particuiar bytes of the 4-byte OD field into which the 
actual operand ^ are; determined by the 2 LSBs of the operand 

10 address and the size of the operand. : For example, if the 2 LSBs of the 
-.' ad ^ ress a 2 -byte operand are 10, then the actual operand will be written 
into the third and fourth bytes of the OD field. As discussed above, in the 
discussed embodiment OC 104 does not store operands that straddle a 4-byte 
boundary. In other embodiments, the size of the OD field could^be 
^ ^f® 1 ^- For example; if the 00 field holds 8 bytes, then the particular 

bytes of the OD field into which the actual operand is written are 
— de termihed~6^ t he 3 L^B^^f ~t he operand address and the size of the operand. 

~ The specific embodiment described above is useful for caching *~ 
operands for instructions that are decoded into one or more operations, 
20 exactly one of which requires unlocked access to a cacheable memory operand. 
This embodiment could be modified ( in ways that are obvious to one of 
ordinary skill in the art) to provide an operand cache each of whose entries 
can store more than one operand. Such a cache could be used for speculative 
execution of multiple read access instructions, i.e. instructions that are 
2S decoded into two or more operations each of which requires an operand stored 
in the memory. For example, in one: embodiment, OC 104 could contain 1 IA 
field; one VIA field, 2. OA fields (OA1, and ; 0A2), 2 OO fields (O01 and 002 ) , 
2 VOA fields (VOAl and VOA2 ) , 2 VOD fields, and 2 COUNT (COUNT 1 and C0UNT2 ) 
fields/ The parts of OC 104 for storing the OA 1 and OA2 fields, ^ 
30 respectively, would each be fully : associative, so that an address at which 

the memory will; be written with a particular value could be compared against 
the OA1 and OA2 fields of each entry in order to update the OD1 and OD2 
, fields associated with the matching OA1 and 0A2 fields, respectively, with 
\ the particular value. . ' ' ! <- ., ■ . ..; ■ ■■ 
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System^ Embodiments -\ . 

A processor containing an operand cache in accordance with this 
invention may be incorporated into a wide variety of system configurations, 
illustratively into standalone and networked personal computer systems, 
workstation systems, multimedia systems, network server systems, 
multiprocessor systems, embedded systems, integrated telephony systems, 
video conferencing systems, etc. Figs. 6-8 depict an illustrative set of 
suitable system configurations for processor 100 that contains operand cache 
104; (fig. 1). ;, ; > 

In particular, Figs. 6-8 depict suitable combinations of a processor 
containing an operand cache in accordance with this invention with suitable, 
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bus configurations, memory hierarchies, and cache configurations, I/O 
interfaces, controllers, devices, and peripheral- components . 

The set of system configurations depicted in" Figs. 6-8 is merely 
illustrative and alternative combinations of bus configurations, memory 
5 hierarchies, I/O interfaces, controllers/devices, and peripheral components 
are also suitable. For example, suitable configurations for a system 
incorporating processor 100 include combinations of components, cards, 
interfaces, and devices such as:- 

I. video display devices, monitors, flat-panel displays, and 
10 touch screens;' ." r . '"'>% : " " . ' 

2V-<' point ing^dev ices and keyboards; 

3. coprocessors," floating point processors,; graphics 
processors, I/O controllers, and UARTs ; 

4. secondary and tertiary storage devices, controllers, and 
15 interfaces, caches, RAM, - ROM, flash memory, static RAM, dynamic RAM 

5. CD-ROMs, If ixed disks, removable media storage devices, 
; v floppy disks, WORMs, IDE, controllers, enhanced-IDE controllers, SCSI 

devices, scanners and jukeboxes; 

6. PCMCIA interfaces and devices, ISA busses and devices, 

20 EISA busses and devices, PCI local busses and devices, VESA local busses and 
devices, Micro Channel Architecture busses and devices; 

7. network interfaces, adapters and cards such as for 
ethernet, token ring, lOBase-T, twisted pairs, untwisted pairs, ATM 
networks, frame-relay, ISDN, etc.; 

25 8. video cards and devices, 2-D and 3-D graphics cards, frame 

buffers, MPEG/ JPEG compression/ decompression logic and devices, 
videoconferencing cards and devices, and video cameras and frame capture 
devices; . , 

"9. / computer integrated telephony cards and devices, modem 
30 cards and devices, fax cards and devices; 

10. sound cards and devices, audio and video input devices, 
microphones, and speakers; 

II. data acquisition and control cards and interfaces, 
compression/decompression logic and devices, encryption/decryption logic and 

3 5 devices; and 

12. tape backup units, redundant/ fault tolerant components and 

devices such as RAID and ECC memory. 

Suitable combinations of such components, cards, interfaces, and 

devices (including those enumerated above as well as comparable components, 
40 cards, interfaces, and devices) are too numerous to list. However, those 

skilled in the art will appreciate the full set of suitable combinations and 

will recognize suitable couplings between such components, cards, 

interfaces, and devices. Figs. 6-8 are illustrative of an exemplary subset 

of the full set of suitable combinations. 
45 Fig. 6 shows a networked personal computer incorporating 

processor 100. Alternative embodiments include a cache or caches interposed 



.' . ' between memory 109 and processor 100 . Control logic and storage for such a 
^cach^ on or off processor 100. For example/ level I caches 

-e . instruction cache and data cache) and cache control logic may be 
: included in processor 100 and a level 2 cache may be present outside 
5 processor 100 (e.g. level-two cache 106 of Fig. 1) . Alternative 

distributions are also suitable, although the level 1 caches are preferably 
on-chip with proce s sor 100 . 

'V In the embodiment of Fig. 6, processor 100 and memory 109 are 
included as parts of motherboard 1033 .A series of adapters, interfaces and 
10 controllers couple the processor to devices and peripheral components. 

These adapters, interfaces andycbntrpl lers are typical 1 y be coupled to the 
processor as cards in a backplane bus of motherboard 1033. However, 
a 1 tern at*i ve embod ime nts may i n c o rpo rat e in dividual ad apt ers , interfaces and 
controllers into motherboard 1033. For example, a graphics adapter 1010 may 
15 be included on motherboard 1033 with processor 100. In either case, 

graphics adapter 1010 is coupled to processor 100 via busses such as those 
" ~d e s c ri bed ~ b e 1 o w~w i t h ; r ef erehce r to Figs 1 0 10 

drives signals to control a display 1001 in accordance with screen updates ■■t 
supplied by processor ,100. A parallel interface 1009 and a serial interface 
20 1008 provide parallel port and serial port signaling- interfaces for 

respectively interfacing to parallel port devices (e.gv, printers such as a 
parallel printer 1002, tape backup units, etc. ) and to serial devices (e.g., 
a modem 1003, pointing devices, and printers). In the embodiment of Fig. 6, 
parallel interface 1009 and serial interface 1008 are shown as separate 
25 interfaces although each is often -incorporated with a hard disk/floppy disk 
controller (such as 10301 as a multifunction card. Hard disk/ floppy disk 
cb n t r o 11 e r 10 3 0 controls access to the media of a hard disk 1032: and to a 
floppy- disk 1031. Typically, hard disk/ floppy disk controllers such as hard 
disjk/fTopipy disk .controller 1030 provide some level of "buffering of reads 
30 and writes . Hard disk/ f loppy disk controller^ 1030 may also provide limited 
caching for data transfers to and from the disk media. 

Suitable- designs for graphics adapter 1010, parallel interface 

1009 , serial inter face 1008 , and;: hard disk/floppy disk controller 1030 are 
well k no wn in the art. Fo r ex amp 1 e , imp lementations of graphics adapter 

35 cards conforming to the VGA standard are commonly available and suitable 
designs are well known to those skilled in the art. Designs for parallel 
and serial interfaces, such as those conforming to the Centronics parallel 
interface and to the RS-232C serial interface specifications, respectively, 
are also well known to those skilled in the art. Similarly, designs for IDE 

40 and SCSI disk controllers are well known in the art and suitable 

implementations are commonly available. In each case, graphics adapter 

1010, parallel interface 1009, serial interface 1008, and hard disk/floppy 
disk controller 1030 are of any such suitable design. 

Finally, a LAN adapter 1007 provides a network interface to 
45 local area networks such as 802.3 ethernet, lObase-T, twisted pair, and 

token ring networks. As with the other adapters and interfaces , LAN adapter 
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1007 is typically coupled to the processor as a card in the backplane bus of 
motherboard 1033. However, alternative embodiments may incorporate LAN 
adapter 1007 into motherboard .1033 . Suitable cards and devices providing 
network interfaces are well known in the art and LAN adapter 1007 is any 
5 such suitable card or device. 

In the network server configuration, of Fig. 7, multiple 
instances of processor 100 storing operand cache 104 are shown coupled to a 
level 2 cache 106 and to a processor bus 2027. In the 'embodiment of Fig. 7, 
processor 100 includes control logic (part of MCS 105 of Fig. 1) for cache 

10 106. The cache control logic {not shown) is coupled to 106 via a 64-bit 
cache bus. Alternative embodiments of processor 100 may offload the 
functionality of control logic for cache 106. In such an alternative 
embodiment, the cache control logic may be interposed between processor 100 
and level 2 cache 106. In the context of bus structures presented in Fig. 

IS 7, che cache control logic could be coupled to processor 100 via processor 

bus 2027. Suitable modifications to the cache configuration of Fig. 7 (such 
— as-providing- a-cache -in r processor -100 ) -w il k- be apparent—to-t hose skil led in 
the art. . - _ 

Referring again to Fig. 7, processor 100 is coupled to a memory 

20 controller 2002 and to a system controller 2005 via a 64-bit processor bus 
2027. Memory controller 2002 provides a 64-bit interface to memory 109 
including an 8-bit parity interface to. support Error Correcting Codes (ECC). 
ECC memory is desirable, but optional, and alternative embodiments may forgo 
the parity interface. System controller 200S provides the interface (or 

25 bridge) between the 64-bit processor bus 2027 and the 32-bit local bus 2009. 
Local bus 2009 is any high-speed I/O bus, for example,' a VESA Local bus (VL 
bus) or Peripheral Component Interconnect (PCI) bus. A system controller 
2005 provides buffering to support the potentially disparate clock rates of 
processor bus 2027 and local bus 2009. System controller 2005 arbitrates 

30 for use of the two busses (2027 and 2009,) and may, in certain 

configurations, support burst data transactions across the two busses. 
Suitable designs for interbus bridges, such as system controller 2005 
(bridging processor bus 2027 and local bus 2009.). and a bridge and peripheral 
controller 2006 (bridging local bus 2009 and ISA bus 2010, as described 

35 below) are well known in the art. ' For example, U.S. Patent No. 5,414,820, 
"Crossing Transfers for Maximizing the Effective Bandwith of a Dual-Bus 
Architecture," to McFarland et al., the entirety of which is incorporated 
herein by reference, describes a design suitable for bridging a high-speed 
system bus and a slower I/O bus. System controller 2005 and bridge and 

40 peripheral controller 2006 are of any such suitable design. 

Local bus 2009 couples to multiple local bus devices and 
components (illustratively, to an IDE controller. 2008, a SCSI Adapter 2018, 
a LAN Adapter 2019, and bridge and peripheral controller 2006). certain of 
the local bus devices and components on local bus 2009 may optionally be 

45 provided as cards coupled to the local bus 2009 by a modular connector. In 
the embodiment of Fig. 7, IDE controller 2008, SCSI adapter 2018, and LAN 
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adapter 2019 are provided as cards coupled to the local bus 2009 by a 
modular connector. Bridge and peripheral controller 2006 is directly 
connected to'the local bus 2009. Alternative .configurations (including 
configurations in which one or more of the IDE controller 2008 r SCSI adapter 
2018, and LAN adapter 2019 are directly connected to local bus 2009) are 
also suitable and will be appreciated by those skilled in the art. In 
addition/ alternative embodiments may couple a display adapter to local bus 
2009 thereby taking advantage of the generally higher bandwidth and 
throughput of local bus 2009 for screen updates (when compared to 
alternatives such as ISA> EISA, and Micro Channel Architecture busses). 
Because display device requirements are typically less demanding in network 
server configurations than in personal computer or workstation 
configurations,, display adapter 2020 is shown coupled to the. lower bandwidth 
ISA bus 2010. > • • ': • / ' 

IDE controller 2008 is representative of a variety of controller 
designs^ (including IDE, enhanced IDE, ATA, and Enhanced Small Device 
Interface ("ESDI ) controller designs) f or i nt e r f a c i ng~s tor age d e v ice s su c h as 
disks, tape drives, and, CD-ROMs. IDE controller 2008 is coupled to two 
disks (hard disk 2011 and floppy disk 2012) and to a tape backup unit 2013. 
Alternative configurations may interface an IDE/enhanced IDE CD-ROM via IDE 
controller 2008, although a both a CD-ROM 2015 and a CD jukebox 2017 are 
interfaced via SCSI adapter 2018 in the embodiment of Fig. 7. Suitable 
designs for hard disks, floppy disks, CDrROMs, and tape drives are all well 
known in the art and modular components based on those designs are commonly 
available for IDE/ enhanced IDE, and ATA based controller designs. IDE 
controller 2008 is of any such suitable design, including enhanced IDE, ATA, 
and ESDI alternatives. 

1 / SCSI adapter 2018 is coupled- to local bus 2009 and to multiple 
SCSI devices' ( illustratively, to a Redundant Array of Inexpensive Disks 
(RAID) 2014, CD-ROM 2015, scanner 2016 , arid CD jukebox 2017) in a daisy 
chain corif iguratibn. For illustrative 1 purposes , the daisy chain of SCSI 
devices is shown as a bus in Fig. 7. Additional SCSI devices may also be 
coupled to SCSI adapter 2018 and additional SCSI adapters may be coupled to 
local bus 2009 to provide even larger numbers of SCSI device connections. 
Additional lyysicsi adapter 201fT and/ or additional ; SCSI adapters may be 
coupled to an Industry Standard Architecture (ISA) bus such as ISA bus 2010, 
although coupling to a local bus such as r local bus 2009 is generally 
preferable because of the higher bandwidth and throughput of local busses 
conforminig to standards such as the VL bus or PCI standards. 

In addition to the set of SCSI devices shown in Fig. 7, 
additional hard disks, printers, LAN adapters, and other computer systems 
may be coupled to processor 100 via a SCSI adapter such as SCSI adapter 
2018. Additionally, SCSI adapter 2018 is representative of suitable 
alternative device adapters such as SCSI-2 and ESDI adapters. Suitable 
designs for RAIDs, scanners, CD-ROM jukeboxes, hard disks, CD-ROMs, 
printers, LAN adapters and tape drives are all well known in the art and 
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modular components based on those. designs are commonly available for SCSI 
adapter designs ., SCSI adapter 2018 is of -any such suitable, design , 
including SCSI-2 and ESDI alternatives. 

LAN. adapter 2019 is coupled to local bus 2009 and, in the 
embodiment of Fig . 7 , provides support ' for an IEEE 802. 3 Carrier Sense 
Multiple Access with Collision Detection (CSMA/CD) local area network, 
although adapters for alternative network configurations and for media 
variations of an 602.3 network are also be suitable. LAN adapter 2019 is 
therefore representative of suitable alternative device adapters such as 
those, based on IEEE 802 .x standards /(e.g. , 802.3 baseband ethernet oh 
coaxial media, twisted and untwisted "pair media, and lObase-T, 802.3 
broadband networks, 802.4 token passing networks, 802 . S * token ring networks, 
etc.), and those based on Fiber Distributed Data Interface ( FDDI ) standards . 
Designs for such suitable network adapters are well known in the art and 
modular components based on those designs are commonly available for both VL 
bus and PCI bus connections. In addition, suitable designs for network 
adapters with— I-SAT^SCSI— and^SCS 1^2 "inter f aces~are~also~a re" wel T known "in 
the art and modular components based on those designs are also commonly _ 
available. Alternative embodiments may therefore incorporate LAN adapters 
such as LAN adapter 2019 coupled to processor 100 via ISA bus 2010 or SCSI 
adapter 2018, although coupling to a local bus such as local bus 2009 is 
generally preferable to the ISA bus alternative because of the higher 
bandwidth and throughput of local busses conforming to standards such as the 
VL bus or PCI standards. LAN adapter 2019 is of any suitable design, for 
any suitable network- topology and medium, and is coupled to 'any of the 
suitable bus structures (e.g. , VL bus., PCI bus, ISA bus, SCSI, etc.). 

ISA bus 2010 is coupled to local bus 2009 via bridge and 
peripheral controller 2006. Suitable bridges, like system controller 2005, 
are well knbwa in the art, and bridge and peripheral controller 2006 is of 
any suitable design. ISA bus 2010 provides a lower-speed (when compared to 
local bus. 2009), 16-bit I/O bus and provides, modular connections for a 
va r iet y p f per iphe r a 1 components including display adapter 2020, telephony 
card' 2,02 6, and a multifunction I/O card such as super I/O 2028. Display 
adapters such as display adapter 2020 are well known in the art and provide 
varying degrees of support for-advanced graphics functions. For example, 
simple text display adapters provide text and character based graphics only. 
More sophisticated display adapters, such as those implementing SVGA, XC A, 
VESA, CGA, and Hercules graphics standards provide mult ibit color and higher 
display resolutions. Specialized display adapters may provide more advanced 
features, such as hardware support for 24-bit color, 3-D graphics, hidden 
surface removal, lighting models, Couraud shading, depth queuing, and 
texture mapping. As described above, display device requirements have 
typically been less demanding in network server configurations than in 
personal computer or workstation configurations* As a result, display 
adapter 2020 is shown coupled to the relatively low bandwidth ISA bus 2010. 
However, alternative embodiments may couple an advanced or specialized 
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display adapter to local bus. 2009 thereby? taking advantage of the generally 
n i gh e r b ahdwid t h . . a nd throughput of local bus 2009 for screen updates (when 
compared to Alternatives such as ISA, EISA, and Micro Channel Architecture 

. busses).; _ v.- ...-*/. . , \ : ^:}. : :' ■ " 

Super I/O 2028 provides support for a serial port 2022, a 
parallel port 2023, a pointing device 2024, and a disk/2025. Suitable 
designs for combination ISA cards such as super I/O 2028 are well known in 
the art: and such cards are commonly available. Super I/O 2028 is of any 
such suitable design. Modems may be coupled via serial port 2022 and 
printers may be, coupled via either the serial port 2022 or parallel port 
2023 provided by super I/O 2028. Alternatively, a single function ISA card 
may-be used for such purposes . Modem and fax/modem cards are one example of 
such a single function card. Telephony card 2026 is representative of cards 
providing voice, fax,; and speech recognition, modem, fax-on-demand services, 
15 etc. Suitable telephony cards typically conform to standards defining a 
modular architecture for integrating computer-based application with 
— - telephony -hardware. — These "standards incl^ 

Specification (CAS) and the more comprehensive Signal Computing System " - 
Architecture (SCSA) standard. Telephony card 2026 is of any such suitable 
20 design. 

Preferably, a high performance server configuration, such as 
• that shown in Fig. 7 , includes a hierarchy of busses with varying 
w performance characteristics each matched to the devices and components 
coupled thereto.,. Those skilled in the art will recognize a variety of 
25 suitable variations on the bus hierarchy of Fig. 7, including the 

elimination individual busses, the addition of multiple instances of 
individual busses, and redistribution of devices and components among the 
various busses. The server configuration of Fig. 7 is representative of all 
such suitable, variations. ■ ■ . ■ - 't : ;-v- 

30 A multimedia workstation configuration for processor 100 shown 

in Fig. 8. As with the server configuration of Fig. 7, the multimedia 
workstation configuration includes a hierarchy^c.f busses with varying 
performance characteristics each matched to the devices: and components 
coupled thereto. ; Those skilled in the art will recognize a variety of 
35 suitable variations on the bus hierarchy of Fig. 8 . A memory bus 3002 

couples processor 100, a cache 3001, memory 109/ and a bridge 3004. The 
instructions and their operands are stored in cache 3001 and memory 109. As 
with the network server configuration of Fig. 7, a variety of cache 
configurations are suitable for a multimedia workstation. Cache 3001 
40 including control logic is coupied to processor 100 via memory bus 3002. 

Alternative embodiments of processor 100 (such as that shown in Fig. 1), may 
incorporate functionality of the control logic for: cache 3001, thereby 
enabling a direct connection to the storage of cache 3001. Suitable 
modifications to the cache configuration of Fig. 8 (such as providing a 
cache in processor 100) will be apparent to those skilled in the art. 
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An I/O bus 3005 is comparable to local bus 2009 of Fig. 1 and is 
preferably a high speed I/O bus such as a VL bus or PCI bus. A SCSI adapter 

3006, a LAN adapter expansion bus bridge 3008, a graphics adapter 3009, a 
sound adapter 3024, and a motion video adapter 3021 are coupled to each 
other and to processor 100 via I/O bus 3005. SCSI adapter 3006, LAN adapter 
300 Z: and expansion bus bridge 3008, together with the components and 
devices coupled to each are comparable to corresponding adapters, 
components, and devices discussed above with reference to Fig. 6. 

In particular, SCSI adapter 3006 is coupled to multiple SCSI 
devices (illustratively, a disk 3011, a tape backup unit 3012, and a CD-ROM 
3013) in a daisy chain configuration. For illustrative purposes, the daisy 
chain of SCSI devices is shown as a bus . Additional, SCSI devices may also 
be coupled to SCSI adapter 3006. Suitable designs for SCSI adapter 3006 are 
discussed above with reference to the comparable SCSI adapter 2018 of 
Fig. 6. Variations on the set of SCSI devices, and on SCSI configurations 
in general (each of which has been described above with reference to Fig. 6) 
-are-alsoapplicable-in -the -mul-timedia-workstation conf igurat ion of Fig", c: 
Similarly, suitable designs and variations on LAN adapter 3007 are also w 
described above in the context of the comparable LAN adapter 2019 (see 
Fig. 6). Furthermore^ suitable designs and variations on expansion bus 3017 
are described above in the context of the comparable ISA bus 2010 (see 
Fig. 6). As described above/ suitable designs for SCSI adapter 2018 and ISA 
bus 2010 are. well known in the art and modular components based on such 
suitable designs are commonly available. . SCSI adapter 3006, LAN adapter 

3007, and expansion bus 3017 (together with the components and devices 
coupled thereto) are comparable. SCSI adapter 3006, LAN adapter 3007, 
expansion bus bridge 3008, and expansion bus 3017 are therefore also of any 
such suitable designs. 

■Referring to Fig. 8, multimedia adapters, such as a sound 
adapter 3024, a motion video adapter 3021, and a graphics adapter 3009, are 
each coupled to processor 100 via I/O bus 3005 and memory bus 3002 to 
provide f or high^bandwidth transfers' of multimedia data between the 
multimedia adapters, memory 109, and secondary storage devices (e.g., disk 
3011)1 Sound adapter 3024 provides digital-to-analog (0/A),and analog-to- 
digital (A/D) interfaces for respectively synthesizing and sampling audio 
signals. The D/A and A/D interfaces of sound adapter 3024 are respectively 
coupled to an audio performance device, such as a speaker 3026, and an audio 
signal acquisition device, such as a microphone 3025. Other suitable audio 
performance devices include mixing consoles, signal processing devices, 
synthesizers, MIDI sequencers and power amplifiers. Other suitable audio 
signal acquisition devices include signal processing devices and digital 
samplers. Suitable designs for sound cards are well known in the art and 
sound adapter 3024 is of any such suitable design. 

Motion video adapter 3021 provides support for capture and 
compression of video signals, for example, from a video camera 3020. In 
addition, motion video adapter 3021 supplies a display device 3023 such as a 
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television, highrdefinit ion television, or a high resolution computer 
monitor with display signals via a frame buffer 3022.; Alternative 
embodiments of motion video adapter 3021 may eliminate the frame buffer 3022 
and directly drive a raster display. Furthermore, alternative embodiments 
of. motion video adapter 3021 may decouple the video input and video output 
functionality of motion video adapter 3021, and instead provide separate 
video input and video output components. 

Because video information requires large amounts of storage 
space, it is generally compressed. Accordingly, to display compressed video 
information, for example from data represented on a compact disk in CD-ROM 
3013, the compressed video information must be decompressed. High bandwidth 
burst mode data transfers are supported by I/O bus 3005, which is preferably 
a local bus; such as PCI with support for L arbitrary length burst data 
transfers. In the multimedia workstation configuration of Fig. 8, video. 
15 compression and decompression can be performed by processor 100 and/or by 

motion video adapter 3021. Thus, memory bus 3002 and bridge 3004 preferably 
support burst data transfers across the bridge (3004) between memory bus. 
3002 and I/O bus 3005. Suitable designs for motion video adapters typically" 
provide support for the Moving Pictures Expert Croup (MPEG : ) standards for 
20 video encoding and decoding (e.g. , MPEG- 1 arid MPEC-2) and for JPEG. In 
addition, motion, video adapter 3021 may support video conferencing by 
providing implementing video compression/decompression algorithms in 
accordance with H,261 (the standard compression algorithm for H.320 
J videoconferencing) . Suitable designs for implementing such 
25 compress ion/decompression algorithms are well known in the art and motion 
video adapter 3021 is of any such suitable design. 

Graphics adapters such as graphics adapter 3009 are we 11 known 
in the; art and provide varying degrees of ^support for advanced graphics 
functions.! For example, graphics adapters/- such as those implementing SVGA, 
30 XGA> VESA, CGA^ and Hercules graphics standards provide multib it color and 
higher display resolutions. ^ Specialized display adapters may provide more 
advanced features, such as hardware support for 24-bit -color , 3-D graphics, 
hidden surface removal, lighting models, Gouraud shading, depth queuing, and 
texture mapping. Suitable designs for graphics adapters are well known in 
35 the art and modular components based on these designs are commonly 
available. Graphics adapter 3009 is of any such suitable design. 
Alternative embodiments may combine the graphics display functionality of 
graphics adapter 3009 with the motion video display functionality of motion 
video adapter 3021 outputting on a single high-resolution display device. 
40 " " ' • ." 

Conclusion 

While the above is a description of a specific embodiment of the 
invention, various alternatives, modifications, and equivalents may be used. 
For example, as discussed above, in alternative embodiments the part of the 
45 operand cache storing the IA fields could be a set-associative orXdirect- 
mapped structure. In addition, the OA field (used to facilitate coherence 
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between the memory and the operand cache) could be eliminated; at the cost of 
a higher frequency of speculative executions using incorrect operands, as 
discussed above. In some- embodiments; each entry of the operand cache could, 
store more than one operand for an instruction. Therefore, - the above 
description should not be taken as limiting the scope of the invention which 
is def ined by the appended claims. - - 
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WHAT IS CLAIMED: , ; „ 

.1 - V '''"'. ; V. 'I; 4 " 1 - 1 . ■'■ A method comprising the steps of: 

2 retrieving- an instruction from a memory, the instruction being 

3 stored at a first address in the memory; 

4: determining that execution of the, instruction requires execution 

■ :5 of an ,6 per at ion in an execution unit: of a processor, the operation requiring 

6 a first operand that is stored in the memory; 

7 retrieving a second operand from an entry of an operand cache, 

8 wherein the entry corresponds to the first address; and 

9 executing the operation in the execution unit using the second 
10 operand as a substitute for the first 1 operand. 

1 2 . The method of claim 1, further comprising the steps of: 

2 determining that the first operand is stored in the memory at a 

3 second address; . 



~4~ ~T reprieving the first operand from the memory; 

5 comparing the first and second operands; and 

6 if the first and second operands are not equal, aborting 

7 execution of the operation. 

1 3. The method of claim 1, wherein the operand cache includes 

2 first and second memories/ each entry of the first memory can store an 

3 instruction address, each entry of the second memory can store an operand 

4 and corresponds to a respective entry of the first memory, each entry of the 

5 operand cache includes a respective entry of the first memory and an entry 

6 of the second memory corresponding to the respective entry of the first 

7 memory, and the first memory is fully associative. 

1 4 . The method of claim 1, wherein the operand cache includes 

2 first arid second memories, each entry of the first memory can store part of 

3 an instruction address, each entry of the second memory can store an operand 

4 and corresponds to a respective entry of the first memory, each entry of the 

5 operand cache includes a respective entry of the first memory and an entry 

6 of the second memory corresponding to the respective entry of the first 

7 memory, and the first memory is set-associative. 

1 5. The method of claim 1, wherein the operand cache includes 

2 first and second memories, each entry of the first memory can store part of 

3 an instruction address, each entry of the second memory can store an operand 

4 and corresponds to a respective entry of the first memory, each entry of the 

5 operand cache includes a respective entry of the first memory and an entry 

6 of the second memory corresponding to the respective entry of the first 

7 memory, and the first memory is direct -mapped. 
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6. The method of claim 2, further comprising the step of 
replacing the second operand in the entry of the operand cache corresponding 
to the first address with the first operand, if the first and second ' 
operands are not equal. 

7. The method of claim 2, further comprising the steps of: 
retrieving an operand address from the entry of the operand 

cache corresponding to the first address; 

comparing the operand address and second address; and 

if the. operand address and second address are unequal, replacing 

the operand address in the entry of the operand cache corresponding to the 

first address with the second address. 

8. A method comprising the steps of: 

retrieving an instruction from a memory, the instruction being 
stored at a first address in the memory ; 

determining "that execution of the instruction requires execution 
of an operation in an execution unit of a processor, the operation requiring 
a first operand that is stored in the memory; 

retrieving a second operand from an entry of an operand cache, 
wherein the entry corresponds to the first address; 

executing the operation in the execution unit using the second 
operand as a substitute for the first operand, only if a value stored in the 
entry of the operand cache corresponding, to the first address satisfies a 
condition, the value giving an indication of the likelihood of the first and 
second operands being equal; 

determining that the first operand is stored in the memory at a 
second address; 

retrieving the first operand from the memory; 

executing the operation in the execution unit using the first 
operand, if the value. does not satisfy the condition; 

comparing the first and second operands; and 

if the first and second operands are not equal and the step of 
executing the operation in the execution unit using the second operand as a 
substitute for the first operand was performed, aborting execution of the 
operation. 

9. The method of claim 6, wherein: " 

the value that gives an indication of the likelihood of the 
- first and second operands being equal is an integer count; and 

the condition is that the value exceeds. a predetermined 

threshold. 

10. The method of claim 9, further comprising the steps of: 
if the first and second operands are unequal, decrementing the 

count; and 
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4 if, the first and second operands are equal, incrementing, the 

■ : 5 COUht. - ■ J ' ■ - ' ■"■ •; ; " ' .• r ^: V Y 

. 1 . ^ 11- The method of claim 8, wherein: 

2 - 'the value that gives an indication of the likelihood of the 

3 first and second operands being equal is a history bit pattern. 

1 12. A method of maintaining consistency between an operand 

2 cache and a memory, each entry of the operand cache including an operand 

3 address field capable of storing an operand address in the memory and an 

4 ope rand datum field c a pable of stor ing a n ope r and v alue >■ the me t hod - 

5 ■ comprising the steps of: 

6 • writing a particular value occupying a number of bytes to the 

7 memory at a particular address; and 

8 writing the particular value into the operand datum field of 

9 each entry of the operand cache whose operand address field matches the 
'"10 par t Ticu lax address. ~ '■■ : - — * 

1 13. The method of claim 12 , wherein: 

2 the operand address field of each entry of the operand cache 

3 stores a 32-bit byte address; 

4 * the operand datum field of each entry of the operand cache can 

5 store up to a 4 -byte value; 

6 the particular address is a 32-bit byte address; 

7 the particular address matches the operand address field of a 

8 particular entry of the operand cache if and only if the particular address 

9 and the operand address stored in the operand address field at the 

10 particular entry share the same 30 most significant bits; and 

11 \ ^the specific bytes of the operand datum field of a matching 

12 entry of the operand cache into which the particular value is written are 

13 determined by the number of bytes occupied by the particular value and by 

14 the two least significant bits of the part icular address . 



14 . A method of maintaining consistency between an operand 
cache and a memory, each entry of the operand cache including a plurality of 
operand, address fields and a corresponding plurality 7 of operand datum 
fields, each of the operand address fields being capable of storing an 
operand address in the memory and each of the operand datum fields being 
capable of storing an operand value, the method comprising the steps of: 

writing a particular value to the memory at a particular 

address; and 

writing the particular value into each operand datum field of 
the operand cache whose corresponding operand address field matches the 
particular address. 



1 



IS. A method comprising the steps of: 
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2 (a) retrieving an instruct ion s from a memory, the instruction 

3 being stored at a first address in ..the memory; 

4 (b) determining that execution of the instruction requires 

5 execution of first and second operations, in an execution unit of a 

6 processor, the first operation requiring a first operand that is stored in 

7 the memory and the second operation requiring a second operand that is 
8 1 stored in the memory; 

? (c) retrieving third and fourth operands from an entry of an 

.10 operand cache, wherein the entry corresponds to the first address; and 

11 (d) executing the first and second operations in the execution 

12 unit using'the third and fourth operands, respectively, as substitutes for 

13 the first and second operands, respectively. 

1 16- An apparatus for executing an instruction stored at a 

2 first address in a memory, execution of the instruction including execution 

3 in an execution unit of an operation that specifies a first operand to be 

4 " .'re'ad"from"'the~itie(no "the apparatus comprising: ~ 

5 an operand cache including a plurality of entries, the plurality 

6 of entries including a first entry corresponding to the first address and 

7 storing a second operand; 

8 means for using the first address to access the first entry, for 

9 retrieving the second operand and for passing the second operand to the 

10 execution unit; and 

11 means for executing the operation in the execution unit using 

12 the second operand as a substitute for the first operand. 

1 17. The apparatus of claim 16, wherein the means for using the 

•. 2 first address to access the first entry comprise a fully associative 

.3 structure. ^ 

1 ...-18- - The apparatus of -claim "16, wherein the means for using the 

2 first address"' to access the first ;entry comprise a set-associative 
* ,';3, v structure. 

} y 19. The apparatus of claim 16, wherein the means for using the 

2 first address to access the first entry comprise a direct-mapped structure. 

1 2.0. The apparatus of claim 16, further comprising: 

2 means for determining that the first operand is stored in the 

3 memory at a second address; 

4 means for retrieving the first operand from the memory; 

5 means for comparing the first and second operands; and 

6 means for aborting execution of the operation, if the first and 

7 second operands are not equal. 

1 21. The apparatus of claim 20, wherein: 
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: t he entry of the o peran d cache c orrespond i ng to the f irst 
address stores a value that gives an indication of the likelihood of the 
first and second operands being equal. 

22. The apparatus of claim 21, wherein the value that gives 
an indication of the likelihood of the first: and second operands being equal 
is ah integer count r the apparatus further; comprising: 

m eahs for executing the operation in the execution unit using 
the f irst operand instead of the second operand, if the count does. not 
exceed a predetermined threshold. 

23 . The apparatus of claim 22, further comprising; 
means for decrementing the count if the first and second 

operands are determined to be unequal; and 

, means for incrementing the count if the first and second 
operands are determined to be equal. 

24. The apparatus of claim 21, wherein the value that gives ~ 
an indication of the likelihood of the first and second operands being equal 
is a history bit pattern, the apparatus further comprising: ( 

a table including a plurality of entries, each of the entries 
storing a bit and being indexed by a respective bit pattern; and 

means for executing the operation in the execution unit using 
the first operand instead of the second operand, if the bit in the entry of 
the table indexed by the history bit pattern is not set. 

25. The apparatus of claim 24, further comprising: 

a shift register for shifting a logical one bit into the history 
bit pattern, if the first and second operands are determined to be equal and 
for shifting a logical zero Ibit into the history bit pattern, if the first 
and second! operands Sare determined to be unequal . ' 

26. the apparatus of claim 20> further comprising: 5 
means for replacing the second operand in the entry of the 

operand cache corresponding to the first address with the first operand, if 
the first and secpnd operands are not equal; 

27. The apparatus of claim 20, wherein the entry of the 
operand cache- corresponding to the first address stores an operand address, 
the apparatus further comprising: 

means for retrieving the operand address; 

means for comparing the operand address and second address; and 
means for replacing the operand address in the entry of the 

operand cache corresponding to the first address with the second address if 

the operand address and second address are unequal . 
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l'~ 28. An apparatus for maintaining consistency between an 

2 operand cache and a memory, each entry of the operand cache including an 

3 operand address field capable of storing an operand address in the memory 

4 and an" operand datum field capable of storing an operand value, the 

5 apparatus comprising: 

6 the operand cache; and 

7 means for writing a particular value occupying a number of bytes 

8 into the operand "datura f ield of each entry of the operand cache whose 

9 operand address field matches a particular address, in response to a write 
10 of the particular, value to the memory at the particular address. 

1 29. The apparatus of claim 28, wherein: ~' 

2 the operand address field of each entry of the operand cache 
3. stores a 32-bit byte address; 

4 the operand datum field of each entry of the operand cache can 

5 store up to a 4-byte value; 

_ „ 32 -bit byte address ; 

7 the particular address matches the operand address field of a- 

8 particular entry of the operand cache if and only if the particular address 

9 and the operand address stored in the operand address field of the 

10 particular entry share, the same 30 most significant bits; and : 

11 the specific bytes of the operand datum field of a matching 

12 entry of the operand cache into which the particular value is written are 

13 determined by the number of bytes and by the two least significant bits of 

14 the particular address. ^ . 

1 30. An apparatus for maintaining consistency between an 

2 operand cache and a memory, each entry of the operand cache including a 

3 plurality of operand address fields and a corresponding plurality of operand 

4 datum fields, each of - the operand address fields being capable of storing an 

5 operand address in the memory and each of the operand datum fields being 

6 capable of storing an operand value, the apparatus comprising: 

7 the operand cache; and 

8 ' " " means for writing a particular value into each operand datum 

9. field of the operand cache whose corresponding operand address field matches 

10 a particular address, in response to a write of the particular value to the 

11 memory at the particular address. 

1 31. An apparatus for executing an instruction stored at a 

2 first address in a memory, execution of the instruction including execution 

3 in an execution unit of a first operation that specifies a first operand to 

4 be read from the memory and execution in the execution unit of a second 

5 operation that specifies a second operand to be read from the memory, the 

6 apparatus comprising: 
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7 an ^operand cache including a plurality of entries, the plurality 

8. of entries including a first entry corresponding to the first address and 
9 storing third and fourth operands; 

means for using the first address to access the first entry, for 

11 retrieving the third and fourth operands and for passing the third and 

12 . fourth operands to the execution unit; and 
means, for executing the first and second operations in the 

execution unit using the third and fourth operands, respectively, as 
15 substituted for the first and second operands, respectively. 



10 
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